Skip to main content

Corporation Websites Collection

This collection contains an extracted web archive corpus of 0.8+ million corporate websites (from an original list of ~0.98 websites) extracted from the archive.org web archive, covering the period 1996 to early 2017. This corpus was originally created as a collaboration between the Internet Archive and a group at Dartmouth University, but it may be useful to other researchers.

Updated or more detailed information may exist at:

Corpus Statistics

  • approximately 840,000 domains
  • more than 500,000,000 unique URLs
  • more than 1,600 WARC files and 26,000 ARC files
  • more than 3.25 TB compressed

Content

This dataset contains a sample of up to 500 unique text/html URLs per year for given websites. The sample records were selected based on the sort order of the URL strings and extracted and packaged into (W)ARC files. The breakdown of the number of captures per year per domain is available here:

How to Download

This large corpus is split into several hundred distinct "items" on archive.org.

We recommend using the internetarchive python utility (aka, "ia") for bulk downloads. See also:

WARC and ARC downloads can be verified using the unified manifest files at:

Additional manifest and derived information (including CDX files) exist at:

56
RESULTS
rss


PART OF
Web Data Services
Media Type
56
data
Year
56
2015
11
2017
56
2016
19
2014
33
2013
23
2012
20
2011
More right-solid
Collection
Creator
20
internet archive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Corporation Websites Collection
data
eye 2,442
favorite 0
comment 0
Corporation Websites Collection
data
eye 880
favorite 0
comment 0
Corporation Websites Collection
data
eye 866
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,125
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,679
favorite 0
comment 0
Corporation Websites Collection
data
eye 798
favorite 0
comment 0
Corporation Websites Collection
data
eye 733
favorite 0
comment 0
Corporation Websites Collection
data
eye 703
favorite 0
comment 0
Corporation Websites Collection
data
eye 786
favorite 0
comment 0
Corporation Websites Collection
data
eye 736
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,391
favorite 0
comment 0
Corporation Websites Collection
data
eye 630
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,799
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,619
favorite 0
comment 0
Corporation Websites Collection
data
eye 885
favorite 0
comment 0
Corporation Websites Collection
data
eye 805
favorite 0
comment 0
Corporation Websites Collection
data
eye 933
favorite 0
comment 0
Corporation Websites Collection
data
eye 914
favorite 0
comment 0
Corporation Websites Collection
data
eye 731
favorite 0
comment 0
Corporation Websites Collection
data
eye 3,613
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,074
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,263
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,073
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,542
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,556
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,137
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,469
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,038
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,328
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,339
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,374
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,080
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,478
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,274
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,079
favorite 0
comment 0
Corporation Websites Collection
data
eye 1,668
favorite 0
comment 0